Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites.
Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites.
Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for each Crawl.
Worldwide Web Crawls are run using the Heritrix software.
In addition various rules are also applied to the logic of each crawl. Those rules define things like the depth the crawler will try to reach for each host (website) it finds. In general terms the crawling software will identify all the URLs on each page it captures, follow those links, attempt to capture those pages, identify new URLs, follow those links, etc., till the crawl is stopped or pre-set conditions like site depth limits are reached. For the most part a given host will only be captured once per Worldwide Web Crawl, however it might be captured more frequently (e.g. once per hour for various news sites) via other crawls.